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' Abstract: We investigate the VC-dimension of the percep- 
0^ iron and simple two-layer networks like the committee- and 
the parity-machine with weights restricted to values ±1. For 
t^pinary inputs, the VC-dimension is determined by atypical 
^ pattern sets, i.e. it cannot be found by replica analysis or nu- 
<^ merical Monte Carlo sampling. For small systems, exhaus- 
tive enumerations yield exact results. For systems that are 
too large for enumerations, number theoretic arguments give 
lower bounds for the VC-dimension. For the Ising perceptron, 
i-H the VC-dimension is probably larger than N/2. 

\Q ■ 

^ 1 Introduction 

00 ■ 

f^j Presently investigations in different fields including mathe- 
*vO matical statistics, computer science and statistical mechan- 
ks aim at a deeper understanding of information processing 
in artificial neural networks. Every field has developed its 
own concepts which although related to each other are natu- 
«yi rally not identical. In order to use (and appreciate) progress 
made in another field it is hence important to know the 
different concepts and there mutual relation. The Vapnik 
O Chervonenkis-(VC-) dimension is one of the central quanti- 
J"^ ties used in both mathematical statistics and computer sci- 
^ ence to characterize the performance of classifier systems 
Jl|, ||. In the case of feed-forward neural networks it es- 
j_j tablishes connections between the storage and generalization 
03 abilities of these systems j| |], [| . Unfortunately, for most 
architectures the precise value of the VC-dimension ist not 
known and only bounds exist |^| . 

The VC-dimension was introduced to characterize certain 
extreme situations in machine learning. It is therefore very 
useful to derive bounds for the network performance by con- 
sidering the worst possible case. Complementary investiga- 
tions in statistical mechanics focus on the typical behaviour 
described by appropriate averages. In simple situations as 
provided, e.g, by the spherical perceptron it turns out that 
the typical and worst case behaviour are not dramatically 
different 0. It is then comparatively easy to establish con- 
nections between results obtained in different fields. 

In the present paper we discuss some peculiarities that are 
encountered when analysing the VC-dimension of neural net- 
works with binary weights. Binary weights are the extreme 
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case of discrete couplings with obvious advantages in biolog- 
ical and technical implementations. It turns out, however, 
that in this case the typical and the extreme behaviour of 
the network can be rather different. Therefore the relation 
between results obtained by different approaches is less obvi- 
ous. 

Let us also note that in the mathematical literature binary 
weights are usually assumed to take on the values and 1. 
Physically minded people on the other hand prefer the values 
— I and I reminescent of spin systems. As we will show, these 
two choices make a difference. 

The paper is organized as follows: After giving the basic 
definitions in the next section we discuss some simple exam- 
ples in section [|. In section [| we give a short discussion of 
analytical methods using the replica trick to calculate the be- 
haviour of the typical growth function. Section |5| is devoted to 
the numerical investigation the typical growth function for the 
binary perceptron and simple two-layer networks. In section 
|6] we derive bounds for the VC-dimension of neural networks 
with binary couplings including simple multilayer systems. 
The bounds show that the VC-dimension is determined by 
atypical situations. The VC-dimension can hence not be in- 
ferred from the properties of the typical growth function. We 
give arguments that the value of the VC-dimension for net- 
works with binary weights may depend on whether the input 
vectors are continuous, binary (0, 1) or binary (—1, 1). Finally 
section ^ contains our conclusions. 



2 Basic definitions 

The VC-dimension dye is defined via the growth function 
A(p). Consider a set X of instances x and a set C of (binary) 
classifications c : x — > { — 1,1} that group all x S X into 
two classes labeled by I and —I respectively. In the case of 
feed-forward neural networks || with N input units and one 
output unit X is the space of all possible input vectors £ G R N 
or £ e {— 1, +1} , the class is defined by the binary output 
a = ±1 and C comprises all mappings that can be realized 
by different choices of the couplings J and thresholds of the 
network. For any set {x^} of p different inputs x , . . . , x v we 
determine the number A(x 1 , . . . , x p ) of different output vec- 
tors {(Ji, . . . , dp} that can be induced by using all the possible 
classifications c <E C. A pattern set is called shattered by the 
class C of classifications if A(x 1 , . . . , x p ) equals 2 P , the max- 
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imal possible number of different binary classifications of p 
inputs. Large values of A (a; 1 , . . . ,x p ) hence roughly corre- 
spond to a large diversity of mappings contained in the class 
C . The growth function A(p) is now defined by 



A(p) 



max A (a; , . 



(!) 



A(p) 



(2) 



It is clear that A(p) cannot decrease with p. Moreover for 
small p one expects that there is at least one shattered set of 
size p and hence A(p) = 2 P . On the other hand this exponen- 
tial increase of the growth function is unlikely to continue for 
all p. The value of p where it starts to slow down should give 
a hint on the complexity of the class C of binary classifica- 
tions. In fact the Sauer lemma jl], ||] states that for all classes 
C of binary classifications there exists a natural number dye 
(which may be infinite) such that 

= 2 P if p < d YC 

< Eilo (?) if V > dye 

dye is called the VC-dimension of class C . Note that it will 
in general depend on the set X of instances to be classified. 
Hence in the case of neural networks there can be different 
values of dye for the same class of networks depending on 
whether the input patterns are real or binary vectors. 

Due to the max in eq.(|l|) it is possible that the VC- 
dimension is determined by a single very special pattern set. 
In many situations emphasis is, however, on the typical prop- 
erties of the system. In order to characterize the typical stor- 
age and generalization abilities of a neural network a prob- 
ability measure V on the input set X is introduced. One 
then asks for the properties of the typical growth function 
A typ (p) which at variance with eq.([l]) is defined as the most 
probable value of A(o; 1 ,... ,x p ) with respect to the mea- 
sure V . In the relevant limit of large dimension N of the 
input space it is generally assumed that the distribution of 
A(x 1 ,... ,x p ) is sharply peaked around this value. In the 
same limit N — > oo methods from statistical mechanics can 
be used to investigate the properties of A typ (p). This limit 
is non-trivial if a = p/N = 0(1) and results in dye — 0(V). 
We will call aye = lirn/v^oo dye/N the VC-capacity of the 
neural network ]To| . In addition we may define <iy P as the 
value of o p at which A typ (p) starts to deviate from 2 P and 
a vc = limw^oo dya/N. The storage threshold p c is as usual 
defined by A typ (p c )/2*>= = 1/2 and a c = lim^oo Pc/V is the 
storage capacity. 

Using Stirlings formula in cq.(^|) and replacing the sum by 
an integral one can show that for large N the relative devia- 
tion of the upper bound from 2 P becomes O(l) if a > 2ayc 
(see section |j). Since we always have A typ (a) < A (a) this 
implies 



a c < 2ay C 



(3) 



In this paper we concentrate on three sets of classifiers: 
the Ising percpetron, the Ising committee tree and the Ising 
parity tree. The Ising perceptron realizes the classification 
^ i ► i 1 via 



N 

sign(^ Ji & 

i=l 



(4) 



with weight vector J 6 {±1} . The perceptron is a pro- 
totype of what is usually termed single layer feed forward 
networks: The V input values are summed up and the re- 
sulting "field" is passed through a nonlinear function to yield 
a single output value a. The computational capabilities of 
single layer feed forward networks are rather limited. Hence 
one is interested in multi-layer networks, where the output 
of single layer networks is used as input for another single 
layer network. The Ising committee-machine and the Ising 
parity-machine are examples for 2-layer networks. In both 
machines, the input values & are mapped to K binary values 
Tfc by K Ising perceptrons. The are called internal repre- 
sentation of the input. The internal representation is mapped 
onto the final output a = ±1 by the so called decoder func- 
tion in the output-layer. The decoder function is different 
in both machines: The committee-machine uses a perceptron 
with all weights +1, 



k=l 
K N 

sign(Vsign(Vjf^-0)) 



(5) 



k=l 



i=l 



where J'™' is the weight vector of the perceptron that "feeds" 
the fc-th hidden node. The restriction to all weights +1 in 
the output perceptron is not as severe as it may appear: The 
storage properties of this architecture are the same as for a 
machine where the output perceptron is an arbitrary Ising 
perceptron (see appendix |X|) . 

The parity machine simply takes the parity of the internal 
representation: 

K 

a = II Tk 

'" 1 (6) 



A" 



N 



k=l 




Figure 1: Feedforward 2-layer network with tree structure 
and 9 input nodes, 3 hidden nodes and 1 output node. 

In general, a hidden node can receive input from all input 
nodes. In this case we have NK weights to specify. If the 
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input nodes are distributed among the hidden nodes such 
that no input node feeds more than one hidden node, the net 
has a tree structure (see fig. Q). For simplicity we will assume 
that the input nodes are distributed evenly among the hidden 
nodes, i.e. each subperceptron has N/K weights. 

3 Some simple examples 

To begin with let us discuss some simple examples. In the 
case of the spherical perceptron defined by eq.(^|) but now 
with J e R N , J2j J ] = N thc exact results d YC = N + 1 
and a c = 2 have been obtained analytically jllj] . Moreover it 
is well known that the number of different realizable output 
vectors (dichotomies) is the same for all input pattern sets 
in general position Hence the max in eq. (jl|) is realized 
by almost all possible inputs sets of length p and A typ (a) = 
A (a). Furthermore (J3j) is satisfied as equality. 

A particular simple pattern set for which the result for the 
VC-dimension can easily be verified is given by 





= (0,0,0,. 


..,0) 


e 


= (i,o,o,. 


..,0) 


e 


= (0,1,0,. 


..,0) 




= (0,0,... 


,0,1) 



An arbitrary output vector (<7o,ci, . . . ,o~n) can be realized 
for these inputs by using Jj = <jj and 9 = —cfq/2. 

Another interesting example is provided by a perceptron 
for which the couplings are constrained to take the values 
Jj = {0,1} only. Using the set of input patterns described 
above but omitting £° an arbitrary output string (cti, ... , gn) 
can be realized by using Jj = (1 + <Jj)/2 and 9 = —1/2. 
Therefore N < dye < N + 1. On the other hand it is known 
that the storage capacity of this perceptron is given by a c = 
0.59 This large difference between a c and 2ayc is due 
to the fact that the VC-dimension is determined by a very 
special pattern set and that A typ (p) is much smaller than 
A(p). Hence the number of realizable output vectors is no 
longer the same for all input vectors in general position. 

Finally we consider the so-called Ising-perceptron, again 
described by eq.(Q) but now with the constraint Jj = ±1 on 
the couplings. Since the couplings used above to show that 
the pattern set (0) is shattered by a spherical perceptron fulfill 
this constraint it is clear that the VC-dimension of the Ising- 
perceptron is for patterns £ € M. N equal to N + 1, exactly as 
for the spherical perceptron. For 9 = we get dye — N in 
both cases. 

For binary input patterns = ±1 we transform the pat- 
tern set (0) according to — > 2^ — 1. Every output vec- 
tor ((To,cti, ■ ■ • ,cjv) can then be realized by using Jj = Oj 
for j = 1, N and 9 = — <t — TV Therefore the VC- 
dimension is again dye = N + 1. However, since much of 
the interest in neural networks with discrete weights is due 
to their easy technical implementation it is not consistent to 
design an Ising-perceptron with a threshold of order N. More 



interesting is the determination of the VC-dimension of the 
Ising-perceptron without (for N odd) or with a binary thresh- 
old 9 = ±1 (for N even) for binary patterns. This is a hard 
problem (see section ||) . 

We note that the storage capacity of the Ising perceptron 
has been shown to be a c = 0.83 Hence also in this case 
we have a c < 2avc and the VC-dimension is not determined 
by typical pattern sets. We also note that the storage capacity 
is believed to be the same for binary and Gaussian patterns 
fl3| , |l4| , [l5| . As we will see in section ^], it is unlikely that this 
holds also for the VC-dimension. 

4 Analytical methods 

Let us fix a particular set {£ , . . . ,£ p } of input patterns 
fed into a neural network with parameters J. Different val- 
ues of the parameters will result in different output strings 
{a • , ... , a p } and hence the input patterns induce a partition 
of the parameter space into different cells labeled by the real- 
ized output sequences {cr^}. The cells have a certain volume 
V({cr^}) which might be zero if the output string {<7 P } can- 
not be realized. An interesting quantity is the number of cells 
of a given size 

Sf(y) = Tr {al . } 8(y-V({<T»})) (8) 

which, of course, still depends on the particular set of input 
patterns It is possible to calculate the typical value of 

J\f(V) for randomly chosen using multifractal methods 
and an interesting variant of the replica trick |l6| . This cal- 
culation has been explicitly performed for both the spherical 
and the Ising perceptron |l7], |l8) and we give in this section 
a brief summary of the results relevant for the present paper. 
For the perceptron (|^) (with 9 = for simplicity) we have 

V(K})= / d^(J)l[9(a p Je) (9) 

where J d/i(J) = (2 7 re)- Ar / 2 / ^ ,dJj 5{T, J] - N) for thc 
spherical perceptron and / d/j,(J) = 2~ N J2j =±i for the Ising 
case. The natural scale of V for N — > oo is then 2~ N and it 
is convenient to introduce fc({cr M }) = — l/Alog 2 V({a p }) as 
a measure for the size of the cells. Similarily the number of 
cells is exponential in N and we therefore use 

c(k) = ^log 2 A/-(fc) = ±log 2 Tr {aft} 6(k-k({a»})) 

iV iV (10) 

to characterize the cell size distribution. Realizing that c{k) 
is the microcanonical entropy of the spin system {a^} with 
hamiltonian Nkda^}) it can be calculated from the free en- 
ergy 

/(^) = -^log 2 Tr {CTn 2-W i }) (11) 

via Legendre-transform 

c(k)=min p \flk-Pf(/3)}. (12) 



3 



S. Mertens, A. Engcl 



On the VC-dimension of neural networks with binary weights 



From the experience with related systems ( ) one expects / 
(and therefore c) to be self-averaging with respect to the dis- 
tribution of the input patterns The average of f(/3) over 
the inputs can be performed using the replica trick. Within a 
special replica symmetric ansatz the calculation of /(/?) can 
be reduced to a saddle-ponit integral over one (for the spher- 
ical) or two (for the Ising case) order parameters, which are 
evaluated numerically |l7|] . 



over the input distribution necessary to accomplish the cal- 
culation we can at most hope to determine in this way. 
As discussed above aye w ^ onr y coincide with aye if the 
maximum in eq. ([!]) is realized by a typical set of input pat- 
terns. To determine ay<5 we have to find the value of a at 
which the total number of cells A typ starts to deviate from 
2 aN . For avc < a < 2avc an asymptotic analysis of the 
bound in eq.(0) reveals that M 
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Figure 2: Distribution of cell sizes c(fc) in the cou- 
pling space of an Ising perceptron with loading ratios a = 
0.2,0.4,0.833,1.245 (from left to right). Inside the region 
given by the diamonds replica symmetry holds. The dot 
marks the divergences of negative moments. 

Figure || shows some of the resulting curves for the Ising 
case. For a — 0.2 and 0.4 the corresponding curves for the 
spherical perceptron are rather similar. The typical cell size 
is given by fco =argmax c(k). Therefore Vq = 2~ Nka coincides 
with the typical phase space volume as calculated by a stan- 
dard Gardner approach p(| . On the other hand 2 Nc ^ k °^ gives 
up to exponentially small countributions from other cell sizes 
the typical total number A typ of cells as determined for the 
spherical perceptron by Cover ]il| . From the explicit formu- 
lae one can show that for the spherical perceptron c(fco) = a 
as long as fco < oo, i.e. Vq > 0, and c(fc) < a if fco = oo, i.e. 
Vo = 0. 

For the Ising perceptron there is a smallest possible cell size 
k m ax = 1 where only one coupling remains. Hence A typ ~ 
2 7Vc(fc ) if fc < 1 and A typ ~ 2 Nc ^ if fc > 1. The borderline 
fc'o = 1 is realized for a — 0.83 the well known value of a c 
|p"3f . The calculation of the curves c(k) therefore establishes 
the connection between the two complementary approaches 
by Cover and Gardner to determine the storage capacity of 
neural nets. 

Since one has direct access to the number of realizable out- 
put sequences it is tempting to use this approach also to cal- 
culate the VC-dimension analytically. Due to the averages 



-)aN 



-erfc 




Hence it may happen that the relative deviation is exponen- 
tially small in N. In principle we are able to detect this 
deviation by using the whole function c(k). However, for very 
small and very large fc the calculation of c(fc) necessitates 
replica symmetry breaking (|T^|) which renders the calcula- 
tion practically impossible. 

But there is another way to get some information on ctyQ 
from the c(k) curves. It is clear from eq.(|iT|) that /(/3) will 
diverge for all (3 < if some of the cells are empty, i.e. if 
k{{a^}) = — oo. For a < ay^ this is possible only if the 
patterns are linearily dependent. For Gaussian patterns the 
probability for this to happen is zero and therefore no di 



1.0 vergence of / for (3 < will show up for a < ol^q []. For 



a > a 



typ 
VO 



however, there are typically some empty cells and 
/(/3) should be divergent for all f3 < 0. 

Within the replica symmetric approximation one finds this 
divergence of negative moments for both the spherical and 
the Ising perceptron at /? = (a — l)/a if a < 1 and (3 = 0~ 



if a > 1 



This suggests a 



typ 
VO 



1 for both cases. For 



the spherical perceptron this coincides with the known result. 
Moreover the point /3 = 0, a = 1 belongs to the region of local 
stability of the replica symmetric saddle point. For the Ising 
case the result must be wrong since Ofy^ cannot be larger 
than a c rs 0.83. Since also in this case the replica symmetric 
saddle point is locally stable at /3 = 0, a = 1 it is very likely 
that there is a discontinuous transition to replica symmetry 
breaking as typical for this system IT3] . It remains to be seen 
whether a solution in one step replica symmetry breaking can 
provide a more realistic value of otyQ. 

In principle it is possible using the same techniques to ob- 
tain expressions for the typical growth function of simple 
multi-layer nets. However, the technical problems will in- 
crease and replica symmetry breaking is again likely to show 
up. We just note that a related analysis, namely the char- 
acterization of the distribution of internal representations 
within the typical Gardner volume has recently been per- 
formed |2l], ^3| for the committee machine. From these 
investigations a new result for the storage capacity in the limit 
of a large number of hidden units was obtained. 



1 For binary patterns the probability for two identical patterns is 2~ N 
and /(/3) should be divergent for f3 < for all a. This is, however, 
not found in the explicit calculation since by keeping only the first two 
moments of the pattern distribution in performing the ensemble average 
one effectively replaces the original distribution by a Gaussian one. 
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5 Typical growth functions 

The typical growth function A typ (p) of a classifier system 
that is parameterized by N binary variables can be measured 
numerically by an algorithm that mixes Monte-Carlo methods 
and exact enumeration p3 . 

The enumeration is required to determine A(£\ . . . , 
the number of different output vectors that are realizable for 
a given pattern set. To get this number, one has to calcu- 
late the output vectors of all 2 N classifiers! This exponential 
complexity limits the numerical calculations to small values 
of N. 

To get A typ (p) , we draw p random unbiased patterns ^ € 
{±1} N and calculate A(£\ . . . ,£ p ). This is repeated again 
and again, and the values of A(^ 1 , . . . ,£ p ) are averaged to 
yield A typ (p). The scale of A for large N is 0(2^) so we 
average the logarithm: 



logA typ (p) = (log(A(£\... 



(14) 
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Figure 3: Typical growth function of the Ising perceptron 
with binary patterns averaged over 1000 samples. The func- 
tion is of course only defined at discrete values of p, but the 
continuous lines ease the readability. The inset displays the 
values for N — 27 together with the error bars. 

Figure || shows A typ (p) for the Ising perceptron with bi- 
nary patterns. The curves display the expected behavior: 
A*yP(p) = 2P for small p and A typ (p) < 2 P for larger values 
of p. The transition between these to regimes seems to be- 
come sharper with increasing N, but it is not clear whether 
we get a true step- function in the limit N — > oo. The corre- 
sponding curves for the committee- and the parity-tree look 
similar. 

As a test we derive the critical storage capacity a c from 
fig. H by reading off the point where A typ (p) = 2 P_1 . Figure ||| 
shows a c vs. 1/N for the Ising perceptron and the committee- 
and parity-tree with K = 3 each. The extrapolations to N = 
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Figure 4: Critical storage capacity a c deduced from A typ (p) 
for the Ising perceptron, the K = 3 Ising committee tree and 
the K — 3 Ising parity tree. 



co are in good agreement with the anaytical results a c — 
0.83 for the Ising perceptron [fi3[ , a c — 0.92 for the Ising 
committee tree with K — 3 ]25[ and a c = 1 for the Ising 
parity tree with K > 2 ||]. 

For the spherical perceptron A({£^}) is known to be the 
same for all pattern sets in general position. The inset of 
fig. H displays that in the case of the Ising perceptron the 
average over the patterns introduces a statistical error that 
does not tend to zero with increasing number of samples. This 
implies that for the Ising perceptron the number of realizable 
output sequences is not the same for all pattern sets in general 
position. 

The typical VC-dimension riy^ can principally be obtained 
from A typ (p) as the number of patterns for which A typ (p) 
starts to deviate from TP . Due to the statistical errors in 
A typ (p), a separate evaluation of riy^ is more appropriate. 
For this, we calculate A(^ 1 , . . . , £ p ) for a random set of pat- 
terns. If equal to 2 P , the set is enlarged by another random 
pattern and A({£ M }) is calculated again. This step is re- 
peated until the set is no longer shattered. The number of 
patterns in the set (minus 1) gives a value for dy^. These 
values are averaged over many random samples. The results 
are shown in fig. ||. The dependence of dy^ on N is roughly 
given by 



Ising perceptron: 

K = 3 committee-tree: 
K = 3 parity-tree: 



4 p (ao 



N 



d^HN) cx 0.6 N 
d^{N) oc0.88iV. 



(15) 
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Figure 5: Numerical values of dy^(N) for the Ising percep- 
tron, the K = 3 Ising committee tree and the K = 3 Ising 
parity tree. The straight lines between the points are guides 
to the eye. 



6 Bounds for d 



vc 



The exact value of dye for the Ising perceptron with binary 
or zero threshold and binary patterns is not known, not even 
in the limit A — > oo. Only bounds can be provided. 

An arbitrary set of classifiers which are parameterized by 
A bits - like the Ising perceptron - cannot produce more than 
2 N distinct output vectors on any set of input patterns. So 
we have 



rive (AO < N 



(16) 



as a general upper bound for "Ising-like" classifiers. 

Finding good lower bounds is a bit more tedious. It can be 
achieved by explicit construction of shattered sets. In those 
cases, however, where dy^, <C dye, shattered sets with car- 
dinality > G?y,5 are rare and consequently hard to find by 
random search. 

6.1 Ising perceptron 

For the Ising perceptron it is shown in appendix ^ that dye 
is the same for A odd, zero threshold and A — 1, binary 
threshold. Therefore we can safely restrict ourselves to the 
case A odd and no threshold. 

In ref. [E7| a special pattern set is given that yields 



dy C (N) > ^(A + 3). 



(17) 



Shattered sets with cardinality h(N + 3) arc not too rare; 
they do show up in the statistical algorithm of section ||. 

To get an improved lower bound for general values of A, 
we consider a restricted variant of the Ising perceptron, the 



balanced Ising perceptron where the couplings have minimum 
"magnetization" : 



±i 



(18) 



The balanced Ising perceptrons are a subset of the usual Ising 
perceptrons, hence any pattern set that is shattered by the 
former is as well shattered by the latter. 

Now let {i; 1 , . . . ,£ p } be a shattered set for the balanced 
Ising perceptron with A nodes and let {cti, ... ,er p } be an 
output vector that is realized by the balanced weight-vector 
J. Going from A to A + 2, we define p + 1 patterns . . . 



(19) 




and new couplings 



J± = (+,J,-) 
J T = (- J,+) 



(20) 



These couplings preserve the output values of the "old" pat- 
terns 

sign^f ) = sign(J^f ) =a u 1 < v < p, (21) 

while the balance property ensures that both classifications 
of the new pattern can be realized: 

N 



sign(J ± f +1 ) = sign(-2 - £ J*) = -1 



sign(J*£ 



1=1 

N 



(22) 



sign 



(2-£ji) = +l. 



Note that J* and J T both are balanced. This allows us to 
apply eqs. (|l^j2^) recursively to obtain the lower bound 

dvc(N) > ^(N + 2c-N Q ) N>N (23) 

for the general Ising percpetron, where c is given by the car- 
dinality of a shattered set for the balanced Ising perceptron 
with A nodes. 

Now we are left with the problem of finding large shattered 
sets for the balanced Ising perceptron. A partial enumeration 
(see below) yields shattered sets with cardinality c = 7, 11, 13 
for N = 9, 15, 17. This gives 

1 



rfvc(A) > -(A + 5) A>9 
rfvc(A) > i(A + 7) A>15 
rfvc(A) > i(A + 9) A>17 



(24) 



The corresponding shattered sets are listed in appendix 
This sequence of increasing lower bounds indicates that prob- 
ably lim^c 



d VC (N) 1 
N ^ 2' 



G 
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There is a method that surely finds the largest possible 
shattered set, i.e. the exact value of dye'- exhaustive enumer- 
ation of all shattered sets. The overwhelming complexity of 
0(2 N ) limits this approach to small values of N, however. 
Nevertheless, the results obtained for TV < 9 are already quite 
remarkable f27l|: 



volume. This process can be repeated until the first non divis- 
ible cell appears. If we assume that the divisibility of a cell 
decreases with its volume, we must look for cell structures 
where the volume of the smallest cell is maximized. This is 
the case for equisized cells, i.e. for orthogonal patterns (fig. H). 



dvc(3) = 3 
rfvc(5) = 4 
rfvc(7) = 7 
d vc (9) = 7 



(25) 



Again the corresponding shattered sets are listed in appendix 
|^. They share a common feature: Using transformations 
that do not change A(p) (see appendix ^), they can be trans- 
formed into quasi orthogonal pattern sets, i.e. into sets, where 
the patterns have minimum pairwise overlap^] 



±1 

N 



fj, = v 



(26) 




Figure 6: Spherical perceptron with N — 2. The 4 cells in 
weigth-space induced by patterns £± and £ 2 are equisized for 
lo = tt/2, i.e. for orthogonal patterns. 

This observation appears reasonable: Consider a shattered 
set of patterns. The corresponding cells in weight-space have 
non-zero volume U({cr M }), i.e. each cell contains at least one 
weight vector J. If we enlarge the shattered set by an addi- 
tional pattern, each cell must divide in two cells of non-zero 



30.0 



20.0 



10.0 



0.0 



• exact 
O balanced 

quasi orth. 

N 

(N+3)/2 



• O 



10 20 

N 



30 



Figure 7: VC dimension of the Ising perceptron with binary 
patterns vs. N. The circles labeled 'quasi orthogonal' and 
'balanced' are lower bounds for the true dye- 

Quasi orthogonal pattern sets can easily be built from the 
rows of Hadamard matrices (see appendix [^). These are 
An x An orthogonal matrices with ±1 entries. To get quasi or- 
thogonal patterns of odd length N, we either cut out one col- 
umn (N — An— 1) or add an arbitrary column (N — 4n+l). It 
is clear that there are many quasi orthogonal pattern sets with 
p elements that can be constructed from a given Hadamard 
matrix. By partial enumeration, i.e. by evaluation of some 
of them, we were able to find shattered sets that exceed the 
lower bound given by eq. (E4J) for certain values of N: 



d vc {N=l5) > 13 
d vc {N = 23) > 17 
d vc (N = 27) > 19 
d vc {N = 31)>24 



(27) 



2 Exact orthogonality cannot be achieved for N odd. 



The corresponding pattern sets are listed in appendix ^| Sys- 
tems with N > 31 were not investigated. Note that the lower 
bound for d\c(N — 31) is larger than the value reported in 
ref. |7L 

Fig7[7] summarizes our results for the N < 31. Both the 
exact values and the lower bounds provided by eq. (pj) and 
eq. (H^) clearly exceed the maximum value dye = ^(N + 3) 
found by the statistical method in section S. The somewhat 
irregular behaviour of the lower bounds does not rule out a 
more regular sequel of the true dvc (N) , including well defined 
asymptotics. However, if the limit limjv^oo dyc/N exists, it 
will probably be larger than 0.5. 
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6.2 Committee tree 



All other patterns differ from H° in only one subpattern: 



To get a lower bound for dy C , the VC-dimension of the Ising 
committee tree with binary patterns, we explicitely construct 
a shattered set based on shattered sets for the Ising percep- 
tron. 

Let {t u } be a shattered set of an Ising perceptron with K 
nodes and be a shattered set of the -j^-node subpercep- 
tron. Then we build patterns H M,I/ for the committee tree by 
concatenation 



t fc-th position 



(35) 



where k = 1, . . . , K and fi = 2, ... , dyc(N/K). This set of 
k(dyc(N/K) — 1) + 1 patterns is shattered. 

Proof: Let {<Jo,<Jk,p,} be a given output sequence for our 
patterns. We choose the weights j( fc ) in the subperceptrons 
such that 



\ T \ S ) 2 > ? * * * ) k > 



(28) 



We proof, that the set {E fti/ ,1 < ^ < d YC (N/K),l <v< 
dvc{K)} is shattered. 

Let {c/j.,1/} be a given output sequence of length 
dyc(K)dyc(N/K). Since the {t v } are shattered, we can 
always find {W£ — ±1}^L 1 such that 



sign(J« • £ X ) = <r 
sign(j( fc>1 ) ■ £) = 1 

si g n(j( i ) ■ e) = 

sign(j( fe>1 ) • e) = 



(36) 



^=sign (XX 7 *) 



(29) 



for /i = 2 . . . dyc{N/K). This is always possible because 

is shattered. With this assignment of weights, the parity tree 

maps {H°,H M '"} to the prescribed output sequence. 

Our shattered set provides us with a lower bound for the 
VC-dimension of the parity tree 



for all v and fi. Now we choose the couplings in the fc-th 
subperceptron such that 



< c (iV) > K(d YC (N/K) - 1) + 1. 



(37) 



This is always 
set. Combinini 



sign 




k = 1, 



,K. 



(30) 



possible because £ M is taken from a shattered 
; eqs. (p9h and (|30) we get 



k f 

. fc=l i=l 



,d vc (K). 
(31) 



For K = 1 the parity tree is equivalent to the simple percep- 
tron and eq. ( |37| ) reduces to dy C ( N) — dye (N) . If one inserts 
the lower bounds for dye into the r.h.s. of eq. [?7|, the result- 
ing values are generally larger than dy^, but the differences 
are much smaller than for the committee-tree, and for some 
values of N dy^ even exceeds the r.h.s. of eq. |3^. We do not 
know whether eq. [57] is only a bad lower bound or whether 
the maximum shattered sets for the parity-tree are not as 
atypical as for the Ising perceptron and the committee-tree. 



i.e 



the patterns ( p8| ) form a shattered set and we find 7 Conclusions 



dvcW > dy C {K)dy C {N/K) 



Note that this lower bound matches the upper bound N when- 
ever dyc(K) and dyc(N/K) meet their upper bounds K resp. 
N/K. Examples: K = 3 or if = 7 and N = 21, K = 7 and 
N = 49. 

This lower bound is much larger than the values for 
dy,5 found section |[ If we assume that aye = 
limjv^oo dyc(N)/N is well defined for the Ising percpetron, 
eq. (B2h reads 



d$ c (N) > Na 2 YC 



N » K » 1 



6.3 Parity tree 

We follow the same strategy and construct a shattered set 
from the patterns of a shattered set of the subpercep- 

trons. The first pattern is simply built from K consecutive 
patterns £ : 



a = (€ 1 ,...,c 1 ) 



(32) The VC-dimension is one of the central quantities to char- 
acterize the information processing abilities of feed-forward 
neural networks. The determination of the VC-dimension of 
a given network architecture is, however, in general a non- 
trivial task. 

In the present paper we have shown that even for the sim- 
plest feed-forward neural networks this task requires rather 
sophisticated techniques if both the couplings of the network 
and the inputs are restricted to binary values ±1. This is 
mainly due to the fact that the VC-dimension defined by a 
suprenum over all pattern sets of given size is determined by 

(33) atypical pattern sets. Consequently Monte-Carlo methods as 
well as analytical estimates involving pattern averages do not 
yield reliable results and one has to resort to exact enumer- 
ation techniques. These methods are naturally restricted to 
small dimensions of the input space but the results obtained 
can be used to get lower bounds for the VC-dimension of 
larger systems. In some cases, even higher bounds can be 
derived from number theoretic arguments. 

Complementary one could argue that typical situations are 

(34) of more interest than the worst case. Accordingly a typical 
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VC-dimension dy^ has been defined in section [ 
has < dye since an average can never be larger than the 
suprenum. 

For the Ising perceptron (J, = ±1) we found dye = N 
as long as the patterns are allowed to take on the value 0, 
no matter whether we use real-valued or {0, l}-patterns. If, 



One always realizes an output sequence [u\, . . . , a p ) for a patterns with 



however, also the patterns are Ising-like, i.e. £ E 
numerical results suggest 



{±1} 



JY 



our 



1 



(TV + 3) < d vc (N) < N 



(38) 



for general N. For large N, the VC-dimension is presumably 
substantially larger than the typical VC-dimension dy^ (N) oc 
N/2. 

Similar results are found for two simple examples of multi- 
layer networks, the committee- and the parity-tree with Ising 
couplings. Here the results are 



d vc {K)d vc (N/K) < d$ c (N) < N 



for the committee- and 



K(d v c(N/K) - 1) + 1 < d$ c (N) < N 



(39) 



(40) 



for the parity-tree with K hidden nodes. For the committee- 
tree we find again that dy^, < c?vc- For the parity-tree, our 
data does not allow to draw the same conclusion, but this 
may be due to the low quality of the lower bound in eq. ( fiof ) . 

We finally note that the growth function A(p) related 
to the VC-dimension is used to derive the famous Vapnik- 
Chervonenkis bound for the asymptotic difference between 
learning and generalization error. This bounds results from 
the analysis of the worst case. It would be interesting to inves- 
tigate whether a similar bound for the typical generalization 
behaviour could be obtained from A typ (p) which in general is 
much easier to determine. 



A Symmetries 

Let j^ 1 ,... ,£ p } be a set of binary ±1 patterns and 
A(^ 1 ,--. ,£ p ) the number of different output sequences 
(<7i, ... , <7 P ) that can be realized by the Ising-perceptron for 
this particular set of patterns. 

A(^ 1 , . . • ,£ p ) is invariant under the following transforma- 
tions on . . . , £ p }: 

• complement a whole pattern: ^ t— > — 

• interchange two patterns: £ M <-> 

• complement one entry in all patterns (jj, = X . . . p): £f i— > 

• interchange two entries in all patterns (/i = 1 . . . p): <-> 

Applying these transformations, we can always achieve that 
all patterns have £^ = — 1. 

Now we assume that N, the number of couplings, is odd 
and that there is no threshold. Let J be a weight vector that 



CM 



N 



N-l 



\<v<p. (41) 



We use the left N — 1 bits of £ y as a pattern set for the Ising- 
perceptron with N — 1 input units and a binary threshold O. 
Identifying O with Jn in eq. (|d]) it becomes obvious that 
A(^ 1 , . . . ,£ p ) is the same in both cases. Hence we may re- 
strict ourselves to the case N odd and no threshold to discuss 
the VC-dimension of the Ising-perceptron. 

Now we consider a 2 layer feedforward network with K 
perceptrons (spherical or Ising) operating between input and 
hidden layer (weight vectors J" and an Ising perceptron as 
decoder function with weight vector jW. Suppose that a 
given output sequence is realized by a weight vector with 
some entries J" = — 1 in the decoder perceptron. The output 
sequence is left unchanged if we set J° = +1 and at the same 
time complement all weights in the fc-th subperceptron 3^ i— > 
— j( k \ This transformation allows us to realize any realizable 
output sequence with all J° = +1. Hence the VC-dimension 
of the committee machine equals the VC-dimension of the 
two layer perceptron with Ising weights in the output layer. 



B Hadamard matrices 

A Hadamard matrix is an m x m-matrix H with ±l-entries 
such that 



HH 1 = ml 



(42) 



where / is the m x m identity matrix. 

If H is an m x m Hadamard matrix, then m = 1, m = 2 
or m = mod 4. The reversal is a famous open question: Is 
there a Hadamard matrix of order m = An for every positive 
n? The first open case is m = 428. 

If H and H' are Hadamard matrices of order m resp. ml ', 
their Kronecker product H ® H' is a Hadamard matrix of 
order mm' . Starting with the 2x2 Hadamard matrix 



-i 
-i 



(43) 



this gives Hadamard matrices of order 4, 8, 16, . . . , 2", the so 
called Sylvester type matrices. Example: 



H. 



2 :i 



/-l -1 -1 -1 -1 -1 -1 -1\ 

/ -1 +1 -1 +1 -1 +1 -1 +1 \ 

-1 -1 +1 +1 -1 -1 +1 +1 

-1 +1 +1 -1 -1 +1 +1 -1 

-1 -1 -1 -1 +1 +1 +1 +1 

-1 +1 -1 +1 +1 -1 +1 -1 

\ -1 -1 +1 +1 +1 +1 -1 -1 / 

\-l +1 +1 -1 +1 -1 -1 +1/ 



(44) 



Let q be an odd prime power. Then Hadamard matrices of 
Paley type can be constructed for 



q + i 
2(9 + 1) 



for q = 3 mod 4 
for q = 1 mod 4 



(45) 
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Paley's construction p8| relies on the properties of finite 
Galois fields GF(g) p^ , where q is an odd prime power, es- 
pecially on the quadratic character \ of GF(g), defined by 



if x = 
xi x ) = { +1 if £ ^ is a square 
— 1 otherwise. 



Then for any a / 



^2 X(x)x(x - a) = -1. 

x€GF(q) 



(46) 



(47) 



To construct a Paley-type matrix for q = 3 mod 4, we start 



with the q x q matrix M 



j) whose rows and columns 



are indexed by the elements of GF(q): 



hi. 



-1 

x(i - i) 



Hence by eq. pT]) 



2^ rrihjmij 
jeGF(«) 



if i = j 
if i ^ j. 



q h = i 
-1 h^i. 



(48) 



(49) 



Now adjoin one row and one column with all entries +1 to 
get a Hadamard matrix of order q + 1. This gives Hadamard 
matrices of order 4, 8, 12, 20, 24, 28, ... . 

Example: q = 11. The Galois field GF(ll) is equivalent 
to the integers {0, . . . , 10} together with their addition and 
multiplication modulo 11. The squares are 1,4,9,5,3 and we 
get 

/ +i +i +i +i +i +i +i +i +i +i +i +i \ 
/ +i -l +i -l +i +i +i -l -l -l +i -l \ 
+i -l -l +i -l +i +i +i -l -l -l +i 
+i +i -l -l +i -l +i +i +i -l -l -l 
+i -l +i -l -l +i -l +i +i +i -l -l 
+i -l -l +i -l -l +i -l +i +i +i -l 
i+i ~ +i -l -l -l +i -l -l +i -l +i +i +i 
+i +i -l -l -l +i -l -l +i -l +i +i 
+i +i +i -l -l -l +i -l -l +i -l +i 
+i +i +i +i -l -l -l +i -l -l +i -l 
+i -l +i +i +i -l -l -l +i -l -l +i 
+i +i -l +i +i +i -l -l -l +i -l 



(50) 



V 



For q = 1 mod 4, the construction starts with the (q+ 1) x 
(q+1) Matrix M = (my), indexed by GF(g)U{oo} as follows 



m 



ooj — rrijoQ 

^oooo 

rriij 



1 for all j £ GF{q) 

(51) 

X(j -i) for i,j e GF(q). 



M is symmetric and orthogonal. To get from M to a 
Hadamard matrix of order 2(g + 1), we define the auxiliary 
matrices A and B by 



.4 



1 1 
1 -1 



B 



1 -1 

-1 -1 



(52) 



and replace every in M by B, every +1 by A and ev- 
ery — 1 by — A. This gives Hadamard matrices of order 
12,20,28,36,52,.... Example: q = 5. GF(5) is equivalent 



to the integers {0, ... ,4} and their addition and multiplica- 
tion modulo 5. The squares are 1 and 4 and we get 



H- 



2(5+1) 



/ B 


A 

.4 


A 


A 

.4 


.4 


A \ 


A 


B 


A 


-A 


-A 


A 


A 


A 


B 


A 


-A 


-A 


A 


-A 


A 


B 


A 


-A 


A 


-A 


-A 


A 


B 


A 


V A 


A 


-A 


-A 


A 


B I 



(53) 



The first value of m = An where neither the Sylvester- nor 
the Paley-construction applies is m = 92. 

C Gallery of shattered sets 

For TV < 9 the exact values of dye have been obtained by 
exhaustive enumerations. Shattered sets of maximum cardi- 
nality are: 

iV = 3 iV = 5 N = 7 iV = 9 



+++ 
— + 
— t— 



+++++ 
-+ — + 
-+-+- 
-++ — 



-+-+-+- +-+-+-+-+ 
— ++ — + — ++ — ++ 
-++ — ++ +-++ — ++- 

+++ ++++ 

-+-++-+ +-+-++-+- 
— ++++- — ++++ — 
The sets for N = 3 and N = 5 are obtained from the rows 
of the Sylvester type Hadamard matrix H 2 2. For N = 3, the 
first column and the last row has been deleted. For = 5, 
a column (+1,-1,— 1,-1) has been adjoined. The sets for 
N = 7 and N = 9 are obtained from the rows of the Sylvester 
type Hadamard matrix H 2 3 - confer eq. (|i|). For N = 7, 
the 8th column and row have been deleted, and for TV — 9, a 
column with alternating ±l's has been adjoined. 

The largest shattered sets we could find for the balanced 
Ising percpetron with binary patterns are: 



N = 9 


N =15 




N = 17 




-+-+-+-+-+-+-+- 




-+-+-+-+-+-+-+-+ 


+++ 


— ++ — ++ — ++ — + 


+ 


— ++ — ++ — ++ — ++ 


++ — + 


-++ — ++ — ++ — ++ 




-++ — ++ — ++ — ++- 


— ++ + 


++++ +++ 


+ 


++++ ++++ 


-+-+-+-+- 


-+-++-+ — +-++-+ 




-+-++-+ — +-++-+- 


-+-++-+ — 


— ++++ ++++- 


+ 


— ++++ ++++ — 


-++ — ++ — 


+++++++ 




-++-+ — +-++-+ — + 




-+-+-+-++-+-+-+ 


+ 


++++++++ 




— ++ — ++++ — ++- 




-+-+-+-++-+-+-+- 




-+-++-+-+-+ — +- 


+ 


— ++ — ++++ — ++ — 




— ++++ — ++ + 




-++ — ++-+ — ++ — + 






+ 


++++++++ 








-+-++-+-+-+ — +-+ 



These pattern sets lead to the lower bounds in eq. (p4|). The 
set for TV = 9 has been found by exhaustive enumeration and 
has no simple relation to a Hadamard matrix. The patterns 
for TV = 15 are rows 2-11, 14 and 15 of the Sylvester type 
Hadamard matrix H 2 <t with the last column deleted. The 
patterns for TV = 17 are rows 2-14 of H 2 i, extended by a 
column of alternating ±l's. 

Pattern sets that exceed the bounds given in eq. (|24|) can 
be constructed for these values of TV: 
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N = 15: Delete the last column from the Sylvester type 
Hadamard matrix H 2 i] then the first 13 rows form a 
shattered pattern set. 

N = 23: Delete the last column from the Hadamard matrix 
H 2 <g> -ffn+i; then the first 17 rows form a shattered pat- 
tern set. 

N = 27: Delete the last column from the Paley type 
Hadamard matrix #2(13+1); then the first 19 rows form 
a shattered pattern set. 

N = 31: Delete the last column from the Sylvester type 
Hadamard matrix H 2 a ; then the rows number 2 to num- 
ber 25 form a shattered pattern set with 24 patterns. 
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